<a href="https://www.nlpfromscratch.com?utm_source=notebook&utm_medium=nb-header"><center><img src="../assets/cover_image_PT1.png"></center></a>

# Introduction to Natural Language Processing and Python

Copyright, NLP from scratch, 2024.

[NLPfor.me](https://www.nlpfor.me)

------------

## Introduction üé¨
In this notebook, we will cover an introduction of natural language processing, and many of the fundamentals of getting started with working with text data in python.

If you are unfamiliar with working with Jupyter, please follow the [directions for setting up a local python environment and working with Jupyter](./assets/working_with_jupyer.pdf) and you may then download the notebook as a `.ipynb` and run in either Jupyter or Jupyterlab.

This notebook covers the following topics:
- Python Fundamentals and Working with Strings in Python
- Regular Expressions and the [`re`](https://docs.python.org/3/library/re.html) module
- The [`pandas`](https://pandas.pydata.org/) library, Dataframes and string data in Pandas

## Python Fundamentals and working with Strings üêçüß∂



Python is a great language to learn as it is easy to pick up even for the non-technical beginner with no prior programming experience. This is largely due to its simple syntax and structure. For natural language processing, working with text data is easy to do in modules included in base python, such as the `string` and `re` (regular expressions) modules, which we will introduce and work with here, but not cover exhaustively. There is also extensive text processing capabilities built into the [pandas](https://pandas.pydata.org/docs/user_guide/text.html) data science library which will be the focus of the last section.

### Variables and Strings

Like all programming languages, python has different [variables](https://en.wikipedia.org/wiki/Variable_(computer_science)) which can hold values of different data types. Like other programming languages, python has *primitive data types*, the fundamental data storage structures of the language. Unlike lower-level languages like C, even primitive data types are stored as *objects*, which means they have associated functions built in (or more formally, *methods*) as we will see with string variables shortly.

We can define any variable in python using the equals operator:

In [None]:
x = 11.3

In [None]:
print(x)

11.3


If we want to know the type of any object, we can either call the `type` function which is built in to base python, or alternatively, every object in python also has a `.__class__` attribute.

In [None]:
type(x)

float

In [None]:
x.__class__

float

Now that we have a basic understanding of variables in python, since this is training for natural language processing, we should get working with text data üôÇ Python stores text as 'strings' - so called because they strings of single characters. Let's first define a string in Python:

In [None]:
my_string = "This is a string about applesauce."
print(my_string)

This is a string about applesauce.


In Jupyter, we can render text in the notebook using [Markdown](https://en.wikipedia.org/wiki/Markdown):

In [None]:
from IPython.display import Markdown

Markdown(my_string)

This is a string about applesauce.

There, that looks a little better. Long strings can be defined in Python using the triple quote (`"""`) to open and close the string, and can span multiple lines:

In [None]:
my_long_string = """This is my string of text data I'd like to work with. It contains letters, numbers such as 1234, punctuation \
such as commas, semicolons; other weird punctuation such as hashtags #, and also special characters such as \\n, \\r, and \\t, \
which represent linebreaks and tabs. I feel I should also mention applesauce."""

In [None]:
Markdown(my_long_string)

This is my string of text data I'd like to work with. It contains letters, numbers such as 1234, punctuation such as commas, semicolons; other weird punctuation such as hashtags #, and also special characters such as \n, \r, and \t, which represent linebreaks and tabs. I feel I should also mention applesauce.

A handy tool to know in python is using formatted strings, or [f-strings](https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals) as they are known. These allow you to format values stored in variables into text variables conveniently for display, or for programatically generating output. F-strings are created by prefixing the quotes with an `f`, and variables to be included are placed between curly French braces (`{`,`}`):

In [None]:
x = 1.337
y = 42
my_formatted_string = f'This a formatted string. {x} is a float. {y} is an int.'

print(my_formatted_string)

This a formatted string. 1.337 is a float. 42 is an int.


### Strings as Arrays

Strings in python are a type of *array* - a sequence of stored values - and can be treated as such. Indexing a string variable will return the single character at that index:

In [None]:
# Full string
print(my_string)

# Character at position 4 in the string
print(my_string[3])

This is a string about applesauce.
s


Furthermore, we can subset a string by using a range for an index, and returning a *slice*. For example, if we just want the characters of the word applesauce, that would be characters 24 to 33:

In [None]:
my_string[23:33]

'applesauce'

<div class="alert alert-warning">
<b> ‚ö† Indexing Quirks in Python ‚ö† </b>
    
Note that python indexing is [zero-based](https://en.wikipedia.org/wiki/Zero-based_numbering), that is, the first element in an array in python is at index 0, **not** index 1.
    
Furthermore, indexing in python in *inclusive* on the low side, but *exclusive* on the high side. That is, a slice created with index range `[4:9]` would be from the fifth character (since indexing is zero-based) up to, but not including the last character at index 9 (the tenth character). So `[4:9]` would be the five characters from the fifth character to the ninth (5 characters).

Confusing, I know. This trips up even experienced python users.
</div>

Notice that index is *inclusive* of the first value, but *exclusive* of the last. Here, `my_string[23:33]` is characters 24 (since python is zero-indexed) to 33 (index 32). This does take some getting used to.

When indexing, the first or last value can be omitted to index from the beginning or to the end, respectively:

In [None]:
# Indexing from the beginning (omit first index)
# First 16 characters
print(my_string[:16])

This is a string


In [None]:
# Indexing to the end (omit last index)
# Characters 17 to end
print(my_string[16:])

 about applesauce.


We can find the length of any string (and other objects) in python using the built-in base function `len`:

In [None]:
len(my_string)

34

### Other Arrays in Python

An array is a type of data structure which is common across many different languages; in python, there are several different types of arrays including [lists](https://docs.python.org/3/library/stdtypes.html#list), [tuples](https://docs.python.org/3/library/stdtypes.html#tuple), [dicts](https://docs.python.org/3/library/stdtypes.html#dict), and [sets](https://docs.python.org/3/library/stdtypes.html#set).

We can create any list and subset different indices or slices of it:

In [None]:
mylist = ['a', 111, False, 0, 55.4, 'applesauce']

# single element
print(mylist[3])

# slice
print(mylist[0:2])

0
['a', 111]


Python treats strings just as arrays of single characters, so subsetting strings is an equivalent operation:

In [None]:
mystring = "The rain in Spain"

print(mystring[4])

print(mystring[12:17])

r
Spain


### String Methods

Every string variable in Python is not a primitive (such as in languages like C), but actually an object of the string class. They contain methods for common text-based operations that are very straightforward to use. For example, we can change text case using `.upper`, `.lower`, and `title`:

In [None]:
# Upper case
my_string.upper()

'THIS IS A STRING ABOUT APPLESAUCE.'

In [None]:
# Lower case
my_string.lower()

'this is a string about applesauce.'

In [None]:
# Title case
my_string.title()

'This Is A String About Applesauce.'

We can also replace every occurrence of a substring within a given string with another substring, using the `.replace` method:

In [None]:
my_string.replace("i", "_EYE_")

'Th_EYE_s _EYE_s a str_EYE_ng about applesauce.'

To search for substrings within a given string, we can use the `find` method. This will return the index of the first character of the first substring match:

In [None]:
my_string.find("applesauce")

23

This can then be used in combination with indexing as we saw before:

In [None]:
applesauce_index = my_string.find("applesauce")

print(applesauce_index)

my_string[applesauce_index:]

23


'applesauce.'

### The Most Useful String Methods

By far, the most commonly used and useful string method is `.replace()` for replacing substrings:

In [None]:
unclean_text = "I love @pples@uace, it's the best sauce! That's what I thought."

# Replacing a single character
print(unclean_text.replace('@', 'a'))

# Replacing a substring
print(unclean_text.replace('thought', 'think'))

# Since string methods return another string, we can 'chain' methods
print(unclean_text.replace('@','a').replace('thought', 'think'))

I love applesauace, it's the best sauce! That's what I thought.
I love @pples@uace, it's the best sauce! That's what I think.
I love applesauace, it's the best sauce! That's what I think.


A close second is that of `.split()`, which uses a specified character as a delimited, and returns an array of substring split by that character (a list). For example, if we have a comma-separated list that we read from a CSV:

In [None]:
row = "apple, banana, cherry, mango"

row.split(",")

['apple', ' banana', ' cherry', ' mango']

Conversely, we can rejoin a list of substrings together around a delimiting character using `.join`. Somewhat unintuitively, this method acts *on* the joining character and takes the list of substrings as input, so they joining character is specified first:

In [None]:
substrings = ['2023', '09', '01']
"-".join(substrings)

'2023-09-01'

Knowing all these string methods is useful for preprocessing and cleaning text data. For example, for normalizing text by making it all lowercase and removing punctuation. In this case, we replace characters we wish to remove with the empty string, `''`:

In [None]:
sample_text = \
"""
This is a block of text with capitalization and also punctuation, including a semi-colon; oh, and a period.
"""

In [None]:
print(sample_text.lower().replace('.','').replace(',','').replace(';',''))


this is a block of text with capitalization and also punctuation including a semi-colon oh and a period



## Regular Expressions *Ô∏è‚É£‚Åâ

Regular expressions, often abbreviated as regex (*"reh¬∑jeks"*) is a general computer science construct for doing advanced pattern matching. Regular expressions are a kind of language to flexibly describe patterns using different character classes and modifiers. They provide a way to define complex search patterns in text for search or find and replace type operations.

In Python, regular expressions functions are part of the base `re` module. Though it is part of base python, we do still need to import it in order to use it:

In [None]:
import re

Now that we have `re` at our disposal, let's built our first regular expression. We want to match any digit in a string. In regex, this can be written as `[0-9]` which is interpreted as any digit from 0 to 9. Furthermore, we need to specify how many occurrences should be expected in the match. There are few ways to do this in regex using *quantifiers*:
- `?` means exactly zero or one match
- `*` means zero or more matches
- `+` one or more matches
- `{n}` matches exactly *n* times.
- `{n,}` matches at least *n* times.
- `{n,m}` matches at least *n* times.

Furthermore, there are multiple functions for working with regular expressions using `re`:
- `re.match`: is used to check if a pattern matches at the beginning of a string. It returns a match object if the pattern is found at the beginning of the string; otherwise, it returns `None`. This is usually used when you just want to check if a string starts with a specific pattern.
- `re.search`: is used to search for a pattern anywhere in a given string. It returns a match object if the pattern is found anywhere in the string; otherwise, it returns `None`. This is useful for when you want to find the first occurrence of a pattern within a string.
- `re.findall`: is used to find all occurrences of a pattern in a string. It returns a list of all matches found in the string. This most useful for finding extracting all occurrences of a pattern from a string.
- `re.sub`: is used for find and replace of patterns in a string with a specified replacement string. It returns a new string where all occurrences of the pattern in the original string are replaced with the specified replacement string. Unlike using `.replace` with a regular string, here we can also replace substrings specified using regular expressions so this is much more flexible and powerful.

Let's try it out! First we will search for the phone number, 3 single digits, followed by a dash, followed by 4 digits, specified by the regex pattern `[0-9]{3}-[0-9]{4}`:

In [None]:
my_string = "Here is a string with a phone number in it: 867-5309"

# Prefix with 'r' for regex pattern
regex_pattern = r"[0-9]{3}-[0-9]{4}"

re.search(regex_pattern, my_string)

<re.Match object; span=(44, 52), match='867-5309'>

Again, `re` returns `Match` objects which include the index of the match. If we wanted to pull out the substring, we can use the span values contained therein:

In [None]:
match = re.search(regex_pattern, my_string)
match.span()

(44, 52)

And we can now use this to subset the original string and only pull out the phone number:

In [None]:
# Pull out the start and end indices
start_index, end_index = match.span()

# Substring
my_string[match.span()[0]:match.span()[1]]

'867-5309'

Now let's do a more complicated example, where we pull out all the prices from a piece of copy. Here, we are interested in extracting values with a `$`, followed by one or more digits `[0-9]`, followed by a period `.`, followed by exactly 2 digits, `[0-9]`.

Hence, our new regex pattern should be:
- The dollar sign `$`. This is a special character, so we need to "escape" it by prepending it with a backslash: `\$`.
- One or more of the digits 0-9: `[0-9]+`
- Followed by a period `.`
- Exactly two digits `[0-9]{2}`

So our final regex patterns is `[0-9]+.[0-9]{2}`. Let's try it out:

In [None]:
website_copy = \
"""
Discover an incredible range of products at unbeatable prices on our retail website.
Whether you're looking for budget-friendly bargains or premium selections, we've got you covered.
Dive into our collection, where you can find fantastic deals like a set of stylish $0.99 smartphone accessories to elevate your tech game.
If you're in the market for high-quality home appliances, explore our kitchen section, where you can find appliances ranging from $199.99
for a sleek microwave oven to a luxurious $1299.99 espresso machine for coffee connoisseurs.
We offer a diverse selection of products to suit every budget, making shopping with us a delightful experience for all.
"""

In [None]:
# From above
regex_pattern = r"\$[0-9]+.[0-9]{2}"
matches = re.findall(regex_pattern, website_copy)
print("Matches:", matches)

Matches: ['$0.99', '$199.99', '$1299.99']


We can see above that we've succesfully pulled out the different dollar values from the website copy. However, if we had more complicated and varied expressions, we'd need to work to make sure our regex capture all the different possible variations. For example, in the below, if we include a prices with a comma delimiter, our current regex will not catch them:

In [None]:
website_copy_2 = \
"""
Fall in love with our delightful assortment of stuffed animals, where cuddly companions come in all price ranges.
For those seeking an adorable plush friend without breaking the bank, we offer charming options like a lovable teddy bear for just $9.99.
Looking to make a grand gesture or celebrate a special occasion? Our premium collection includes exquisite, handcrafted stuffed animals, such as a
majestic life-sized lion for $599.99 or a whimsical unicorn adorned with Swarovski crystals for a lavish $1,199.99.
This could also be written as $1199.99.
Whether you're on a budget or ready to splurge, our stuffed animal selection promises to bring joy and comfort to your life.
Shop now and find the perfect plush partner for every occasion.
"""

In [None]:
# From above
regex_pattern = r"\$[0-9]+.[0-9]{2}"
matches = re.findall(regex_pattern, website_copy_2)
print("Matches:", matches)

Matches: ['$9.99', '$599.99', '$1,19']


We can see that it failed to capture the final `$1,199.99` price correctly. We'd need to write a more flexible expression which optionally included the comma delimiters.

We can update our pattern:
- The dollar sign `$`. This is a special character, so we need to "escape" it by prepending it with a backslash: `\$`.
- Zero or more of the digits 0-9: `[0-9]*`
- Followed by no comma or a single comma `\,?`. Commas are also special regex characters and so must be escaped with a backslash.
- Then our original pattern again: zero or more digits, followed by a period, followed by exactly two digits: `[0-9]+\.[0-9]{2}`


In [None]:
new_regex_pattern = r"\$[0-9]*\,?[0-9]+\.[0-9]{2}"
matches = re.findall(new_regex_pattern, website_copy_2)
print("Matches:", matches)

Matches: ['$9.99', '$599.99', '$1,199.99', '$1199.99']


Not that this regex is still not "perfect" and will capture expressions we may not want such as if a comma appeared without preceding digits:

In [None]:
website_copy_3 = \
"""
I love writing website copy, here is an incorrectly entered price: $,245.22
"""

In [None]:
new_regex_pattern = r"\$[0-9]*\,?[0-9]+\.[0-9]{2}"
matches = re.findall(new_regex_pattern, website_copy_3)
print("Matches:", matches)

Matches: ['$,245.22']


Writing highly specific regular expressions to capture exactly what is wanted or needed without any edge cases is a topic all its own. For example, no standard regular expression for capturing domain names exists, [given the variability thereof](https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch08s15.html).

In general, writing regex which are "good enough" to capture what is needed, and determing what this comprises, is part of the work.

### Activity: Prompting ChatGPT for Regex

One thing that ChatGPT (and other LLMs) are particularly good at is writing regular expressions! Let's give [ChatGPT](https://www.chatgpt.com) some prompts and see if it can come up with the correct regex for the following:
- Phone numbers
- Email addresses
- ISO Date Formats

You can test these out using [regex101](https://regex101.com/) afterward. How did it perform?

## Intro to Pandas üêº

While a lot can be accomplished in base python, most of the base python data structures are not suitable for doing serious machine learning and natural language processing work.

As such, the vast majority of ML work in python is doing using the *data science stack*, or what I refer to as the "holy trinity of data science":
- the [numpy](https://numpy.org/) library for doing numerical computation (*i.e.* working with vectors and matrices)
- the [pandas](https://pandas.pydata.org/) library for data manipulation and working with structured data
- the [matplotlib](https://matplotlib.org/) library for visualizing data

In this section, we will work only with pandas, and as we shall see, it is built on top of numpy and also has matplotlib functionality integrated within it, which makes it possible to visualize data without needing to use the latter directly.

### Series and DataFrames

Pandas works with abstractions that should be familiar to any data practitioner: [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html), which correspond to columns of data composed of individual elements, and [DataFrames]( ) which are correspond to the familiar abstraction of tables composed of columns. Therefore, a DataFrame is composed of multiple Series, each making up a single column therein.

Each Series must store data of one and only one type, therefore each column of a pandas DataFrame must all contain values of the same data type.

This is all getting a bit abstract, so let's take a look at a simple example in the context of text data.

Traditionally, pandas is imported as `pd`, and the sub-modules, functions, and classes within it called from within. Let's create a new Series of text data:

In [None]:
import pandas as pd

# Text data
text_data = ['applesauce', 'beluga caviar', 'cobbler', 'dijon']

# Plunk into a series
text_series = pd.Series(my_text_data)

# Show
display(text_series)

0       applesauce
1    beluga caviar
2          cobbler
3            dijon
dtype: object

We can see that each element in a Series has an associated *index*. This corresponds roughly to the primary key (id) of a table in a database. Just a we can reference elements in any array in python using their numeric index (as we say with subsetting strings) we can also pull out individual elements and slices of a pandas Series using indexing:

In [None]:
# Remember that indexes in python start from 0
text_series[1]

'beluga caviar'

In [None]:
# Slice
text_series[1:3]

1    beluga caviar
2          cobbler
dtype: object

As we saw with regular indexing in python, it is *inclusive* on the low end, and *exclusive* on the high end, so the slice `[1:3]` returns elements 2 (since python is zero-indexed) up to, but not including, element 4 (at index 3).

Indices can be manipulated and need not be numeric (though they usually are). We can replace the index of a given Series:

In [None]:
text_series.index = ['a','b','c','d']
text_series

a       applesauce
b    beluga caviar
c          cobbler
d            dijon
dtype: object

We can then reference elements by using the new index:

In [None]:
text_series['a']

'applesauce'

Confusingly, when using ranges that are non-numeric, they are *inclusive* on the both sides:

In [None]:
text_series['a':'c']

a       applesauce
b    beluga caviar
c          cobbler
dtype: object

Despite replacing the default index, we can still use numeric indexing (this is always an option):

In [None]:
text_series[0:3]

a       applesauce
b    beluga caviar
c          cobbler
dtype: object

Finally, any series work in data science work done in Python would be using the pandas library. In pandas, we work with [DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), which are like tables in Excel or a database. Pandas has the `.str` accessor, which can efficiently apply any base string method to a column of data element-wise:

In [None]:
import pandas as pd

# Create a sample DataFrame with customer IDs
data = {'Customer_ID': ['C123', 'C456', 'C789', 'C101'],
        'Product': ['Widget', 'Gadget', 'Widget', 'Doodad']}
text_df = pd.DataFrame(data)

# Show (before)
display(text_df)

# Applying the .replace() method using the .str accessor
text_df['Customer_Type'] = text_df['Customer_ID'].str.replace('C', 'Type ')

# Show (after)
display(text_df)

Unnamed: 0,Customer_ID,Product
0,C123,Widget
1,C456,Gadget
2,C789,Widget
3,C101,Doodad


Unnamed: 0,Customer_ID,Product,Customer_Type
0,C123,Widget,Type 123
1,C456,Gadget,Type 456
2,C789,Widget,Type 789
3,C101,Doodad,Type 101


# Reading Data with Pandas

You can read in existing data in many different formats with pandas. Let's use by far the most common method for doing so to read in a CSV file from the NLP from scratch [datasets repo](https://github.com/nlpfromscratch/datasets) on Github:

In [None]:
# Clone the repo on local drive of the notebook instance
!git clone https://github.com/nlpfromscratch/datasets

Cloning into 'datasets'...
remote: Enumerating objects: 70, done.[K
remote: Counting objects: 100% (70/70), done.[K
remote: Compressing objects: 100% (62/62), done.[K
remote: Total 70 (delta 14), reused 61 (delta 8), pack-reused 0[K
Receiving objects: 100% (70/70), 34.61 MiB | 18.02 MiB/s, done.
Resolving deltas: 100% (14/14), done.
Updating files: 100% (27/27), done.


In [None]:
# Read in the file with pandas
emoji_df = pd.read_csv('datasets/emoji_sms/train.csv')

# Show a sample of the data
emoji_df.head()

Unnamed: 0,text
0,üåü Hey there! Sending you a wave of üåä positivit...
1,üéâ Just wanted to remind you that you're amazin...
2,üåª Rise and shine! It's a new day full of oppor...
3,"üçÄ Wishing you luck and a day filled with joy, ..."
4,"ü§ó Hey, just dropping by to say hi and sending ..."


Note that pandas can read csv files (and other types of data) which are hosted online directly by passing the [URL as the filepath](](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas-read-csv):

In [None]:
# Read directly from url / online source
emoji_df = pd.read_csv('https://raw.githubusercontent.com/nlpfromscratch/datasets/master/emoji_sms/train.csv')

# Show
emoji_df.head()

Unnamed: 0,text
0,üåü Hey there! Sending you a wave of üåä positivit...
1,üéâ Just wanted to remind you that you're amazin...
2,üåª Rise and shine! It's a new day full of oppor...
3,"üçÄ Wishing you luck and a day filled with joy, ..."
4,"ü§ó Hey, just dropping by to say hi and sending ..."


## Conclusion
That concludes the workshop! I hope you've enjoyed getting started with the python programming language and natural language processing. We will continue next week with acquriing 

----

<table border="0" bgcolor="white">
  <tr></tr>
  <tr>
      <th align="left" style="align:left; vertical-align: bottom;"><p>Copyright NLP from scratch, 2024.</p></th>
      <th aligh="right" width="33%"><a href="https://www.nlpfromscratch.com?utm_source=notebook&utm_medium=nb-footer-img"><img src="../assets/banner.png"></th>
</tr>
</table>